egyptian arabic
Iterative Layer Pruning for Efficient Translation Inference
Moslem, Yasmin, Farouq, Muhammad Hazim Al, Kelleher, John D.
Large language models (LLMs) have transformed many areas of natural language processing, including machine translation. However, efficient deployment of LLMs remains challenging due to their intensive computational requirements. In this paper, we address this challenge and present our submissions to the Model Compression track at the Conference on Machine Translation (WMT 2025). In our experiments, we investigate iterative layer pruning guided by layer importance analysis. We evaluate this method using the Aya-Expanse-8B model for translation from Czech to German, and from English to Egyptian Arabic. Our approach achieves substantial reductions in model size and inference time, while maintaining the translation quality of the baseline models.
- Europe > Ireland > Leinster > County Dublin > Dublin (0.87)
- North America > United States > Pennsylvania > Philadelphia County > Philadelphia (0.04)
- Europe > Spain (0.04)
- (4 more...)
ArzEn-MultiGenre: An aligned parallel dataset of Egyptian Arabic song lyrics, novels, and subtitles, with English translations
This is an open access article under the CC BY license ( http://creativecommons.org/licenses/by/4.0/) 2 R. Al-Sabbagh / Data in Brief 54 (2024) 1 10271 Subject Computer Science, Social Sciences Specific subject area Natural Language Processing, machine translation, large-language models, translation studies, cross-linguistic analysis, lexical semantics Data format Translated and aligned Type of data Texts (Bilingual tables in Microsoft Excel files) Data collection The ArzEn-MultiGenre dataset consists of three genres: song lyrics, novels, and subtitles. The data was gathered from various sources using different methods. A website was crawled for song lyrics using an in-house web crawler, and professional translators manually translated the lyrics into English. For novels, hard copies were collected in English and Egyptian Arabic, then scanned and converted into text files using an Optical Character Recognizer (OCR). The OCR output was then manually reviewed and aligned.
- Africa > Middle East > Egypt (0.05)
- Asia > Middle East > UAE > Sharjah Emirate > Sharjah (0.04)
- Leisure & Entertainment (1.00)
- Media > Music (0.57)
- Media > Television (0.48)
Nile-Chat: Egyptian Language Models for Arabic and Latin Scripts
Shang, Guokan, Abdine, Hadi, Chamma, Ahmad, Mohamed, Amr, Anwar, Mohamed, Bounhar, Abdelaziz, Herraoui, Omar El, Nakov, Preslav, Vazirgiannis, Michalis, Xing, Eric
We introduce Nile-Chat-4B, 3x4B-A6B, and 12B, a collection of LLMs for Egyptian dialect, uniquely designed to understand and generate texts written in both Arabic and Latin scripts. Specifically, with Nile-Chat-3x4B-A6B, we introduce a novel language adaptation approach by leveraging the Branch-Train-MiX strategy to merge script-specialized experts, into a single MoE model. Our Nile-Chat models significantly outperform leading multilingual and Arabic LLMs, such as LLaMa, Jais, and ALLaM, on our newly introduced Egyptian evaluation benchmarks, which span both understanding and generative tasks. Notably, our 12B model yields a 14.4% performance gain over Qwen2.5-14B-Instruct on Latin-script benchmarks. All our resources are publicly available. We believe this work presents a comprehensive methodology for adapting LLMs to dual-script languages, addressing an often overlooked aspect in modern LLM development.
- Africa > Middle East > Egypt (0.14)
- Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.14)
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- (7 more...)
Mufu: Multilingual Fused Learning for Low-Resource Translation with LLM
Lim, Zheng Wei, Gupta, Nitish, Yu, Honglin, Cohn, Trevor
Multilingual large language models (LLMs) are great translators, but this is largely limited to high-resource languages. For many LLMs, translating in and out of lowresource languages remains a challenging task. To maximize data e ciency in this low-resource setting, we introduce Mufu, which includes a selection of automatically generated multilingual candidates and an instruction to correct inaccurate translations in the prompt. Mufu prompts turn a translation task into a postediting one, and seek to harness the LLM's reasoning capability with auxiliary translation candidates, from which the model is required to assess the input quality, align the semantics cross-lingually, copy from relevant inputs and override instances that are incorrect. Our experiments on En-XX translations over the Flores-200 dataset show LLMs finetuned against Mufu-style prompts are robust to poor quality auxiliary translation candidates, achieving performance superior to NLLB 1.3B distilled model in 64% of low-and very-low-resource language pairs. We then distill these models to reduce inference cost, while maintaining on average 3.1 chrF improvement over finetune-only baseline in low-resource translations. This performance gap is caused primarily by scant pre-training data in these languages (Wei et al., 2023; Yuan et al., 2024; Alves et al., 2024), and is di cult to overcome despite growing e orts to support translations of long-tail languages (Kudugunta et al., 2024; Bapna et al., 2022; Lu et al., 2024). In this work, we introduce multilingual fused learning (Mufu), which combines multilingual context and a postediting task when translating into lower-resource languages using LLMs.1 Mufu-style prompts (see Table 1, top block) include several multilingual translation candidates along with a postediting target, from which a model learns "in-context" to translate from languages with which the target language is more closely aligned due to cultural relevance, geographical and genealogical proximity. We rely on a larger, more competent multilingual teacher model to generate auxiliary translations in these languages, which help disambiguate inputs and improve cross-lingual semantic alignment in a translation task.
- Africa > Kenya (0.06)
- North America > United States (0.06)
- Asia > Myanmar (0.06)
- (5 more...)
ArzEn-LLM: Code-Switched Egyptian Arabic-English Translation and Speech Recognition Using LLMs
Heakl, Ahmed, Zaghloul, Youssef, Ali, Mennatullah, Hossam, Rania, Gomaa, Walid
Motivated by the widespread increase in the phenomenon of code-switching between Egyptian Arabic and English in recent times, this paper explores the intricacies of machine translation (MT) and automatic speech recognition (ASR) systems, focusing on translating code-switched Egyptian Arabic-English to either English or Egyptian Arabic. Our goal is to present the methodologies employed in developing these systems, utilizing large language models such as LLama and Gemma. In the field of ASR, we explore the utilization of the Whisper model for code-switched Egyptian Arabic recognition, detailing our experimental procedures including data preprocessing and training techniques. Through the implementation of a consecutive speech-to-text translation system that integrates ASR with MT, we aim to overcome challenges posed by limited resources and the unique characteristics of the Egyptian Arabic dialect. Evaluation against established metrics showcases promising results, with our methodologies yielding a significant improvement of $56\%$ in English translation over the state-of-the-art and $9.3\%$ in Arabic translation. Since code-switching is deeply inherent in spoken languages, it is crucial that ASR systems can effectively handle this phenomenon. This capability is crucial for enabling seamless interaction in various domains, including business negotiations, cultural exchanges, and academic discourse. Our models and code are available as open-source resources. Code: \url{http://github.com/ahmedheakl/arazn-llm}}, Models: \url{http://huggingface.co/collections/ahmedheakl/arazn-llm-662ceaf12777656607b9524e}.
- Africa > Middle East > Egypt > Alexandria Governorate > Alexandria (0.04)
- North America > United States > Hawaii > Honolulu County > Honolulu (0.04)
- Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.04)
- (3 more...)
- Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.73)
DIALECTBENCH: A NLP Benchmark for Dialects, Varieties, and Closely-Related Languages
Faisal, Fahim, Ahia, Orevaoghene, Srivastava, Aarohi, Ahuja, Kabir, Chiang, David, Tsvetkov, Yulia, Anastasopoulos, Antonios
Language technologies should be judged on their usefulness in real-world use cases. An often overlooked aspect in natural language processing (NLP) research and evaluation is language variation in the form of non-standard dialects or language varieties (hereafter, varieties). Most NLP benchmarks are limited to standard language varieties. To fill this gap, we propose DIALECTBENCH, the first-ever large-scale benchmark for NLP on varieties, which aggregates an extensive set of task-varied variety datasets (10 text-level tasks covering 281 varieties). This allows for a comprehensive evaluation of NLP system performance on different language varieties. We provide substantial evidence of performance disparities between standard and non-standard language varieties, and we also identify language clusters with large performance divergence across tasks. We believe DIALECTBENCH provides a comprehensive view of the current state of NLP for language varieties and one step towards advancing it further. Code/data: https://github.com/ffaisal93/DialectBench
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- South America > Brazil (0.04)
- Asia > Bangladesh > Dhaka Division > Dhaka District > Dhaka (0.04)
- (58 more...)
ArzEn-ST: A Three-way Speech Translation Corpus for Code-Switched Egyptian Arabic - English
Hamed, Injy, Habash, Nizar, Abdennadher, Slim, Vu, Ngoc Thang
We present our work on collecting ArzEn-ST, a code-switched Egyptian Arabic - English Speech Translation Corpus. This corpus is an extension of the ArzEn speech corpus, which was collected through informal interviews with bilingual speakers. In this work, we collect translations in both directions, monolingual Egyptian Arabic and monolingual English, forming a three-way speech translation corpus. We make the translation guidelines and corpus publicly available. We also report results for baseline systems for machine translation and speech translation tasks. We believe this is a valuable resource that can motivate and facilitate further research studying the code-switching phenomenon from a linguistic perspective and can be used to train and evaluate NLP systems.
- Africa > Middle East > Egypt > Cairo Governorate > Cairo (0.04)
- Asia > Middle East > Republic of Türkiye > Batman Province > Batman (0.04)
- North America > United States > New York (0.04)
- (5 more...)
- Research Report (0.50)
- Overview (0.46)
- Media (0.68)
- Leisure & Entertainment (0.68)
Learning Arabic from Egypt's Revolution
When you move to another country as an adult, the language flows around you like a river. Perhaps a child can immediately abandon himself to the current, but most older people will begin by picking out the words and phrases that seem to matter most, which is what I did after my family moved to Cairo, in October of 2011. It was the first fall after the Arab Spring; Hosni Mubarak, the former President, had been forced to resign the previous February. Every weekday, my wife, Leslie, and I met with a tutor for two hours at a language school called Kalimat, where we studied Egyptian Arabic. At the end of each session, we made a vocabulary list. In early December, following the first round of the nation's parliamentary elections, which had been dominated by the Muslim Brotherhood, my language notebook read: On many days, I went to Tahrir Square, to report on the ongoing revolution. If I heard unfamiliar words or phrases, I brought them back to class. The following month, I learned "tear gas," "slaughter," and "Can you speak more slowly?" "Conspiracy theory" appeared in my notebook on the same day as "fried potatoes." Sometimes I wondered about the strangeness of Tahrir-speak, and what my Arabic would have been like if I had arrived ten years earlier. But it would have been different at any time, in any place: you can never step into the same language twice. Even eternal phrases took on a new texture in the light of the revolution. After I could understand some of the radio talk shows that cabbies played, I realized that callers and hosts exchanged Islamic greetings for a full half minute before settling down to heated arguments about the new regime. Our textbook was entitled "Dardasha"--"Chatter"--and it outlined set conversations that I soon carried out with neighbors, using phrases that would never be touched by Tahrir: "May peace, mercy, and the blessings of God be upon you." One of our teachers, Rifaat Amin, prepared a five-page handout entitled "Arabic Expressions of Social Etiquette." This supplemented "Dardasha," which also featured some lessons about social traditions, including the evil eye, the belief that envy can cause misfortune. In "Dardasha," icons of little bombs with burning fuses had been printed next to the kind of phrase that, even during a revolution, qualified as explosive: "Your son is really smart, Madame Fathiya."
- Africa > Middle East > Egypt > Cairo Governorate > Cairo (0.26)
- Europe > Western Europe (0.04)
- Africa > Middle East > Egypt > Eastern Desert > Southern Province (0.04)
- (10 more...)
- Media (1.00)
- Leisure & Entertainment (1.00)
- Law Enforcement & Public Safety (1.00)
- (4 more...)